A comparative study of the K - means algorithm and the normal mixture model for clustering : Univariate case
نویسندگان
چکیده
This paper gives a comparative study of the K-means algorithm and the mixture model (MM) method for clustering normal 7 data. The EM algorithm is used to compute the maximum likelihood estimators (MLEs) of the parameters of the MM model. These parameters include mixing proportions, which may be thought of as the prior probabilities of different clusters; the maximum 9 posterior (Bayes) rule is used for clustering. Hence, asymptotically theMMmethod approaches the Bayes rule for known parameters, which is optimal in terms of minimizing the expected misclassification rate (EMCR). 11 The paper gives a thorough analytic comparison of the two methods for the univariate case under both homoscedasticity and heteroscedasticity. Simulation results are given to compare the two methods for a range of sample sizes. The comparison, which 13 is limited to two clusters, shows that the MM method has substantially lower EMCR particularly when the mixing proportions are unbalanced. The two methods have asymptotically the same EMCR under homoscedasticity (resp., heteroscedasticity) when the 15 mixing proportions of the two clusters are equal (resp., not too unequal), but for small samples the MMmethod sometimes performs slightly worse because of the errors in estimating unknown parameters. 17 © 2007 Elsevier B.V. All rights reserved. MSC: 62H30; 62F10 19
منابع مشابه
Repeated Record Ordering for Constrained Size Clustering
One of the main techniques used in data mining is data clustering, which has many applications in computer science, biology, and social sciences. Constrained clustering is a type of clustering in which side information provided by the user is incorporated into current clustering algorithms. One of the well researched constrained clustering algorithms is called microaggregation. In a microaggreg...
متن کاملClustering and Ranking University Majors using Data Mining and AHP algorithms: The case of Iran
Abstract: Although all university majors are prominent and the necessity of their presences is of no question, they might not have the same priority basis considering different resources and strategies that could be spotted for a country. This paper focuses on clustering and ranking university majors in Iran. To do so, a model is presented to clarify the procedure. Eight different criteria are ...
متن کاملAn Improved K-Means with Artificial Bee Colony Algorithm for Clustering Crimes
Crime detection is one of the major issues in the field of criminology. In fact, criminology includes knowing the details of a crime and its intangible relations with the offender. In spite of the enormous amount of data on offenses and offenders, and the complex and intangible semantic relationships between this information, criminology has become one of the most important areas in the field o...
متن کاملAn Optimization K-Modes Clustering Algorithm with Elephant Herding Optimization Algorithm for Crime Clustering
The detection and prevention of crime, in the past few decades, required several years of research and analysis. However, today, thanks to smart systems based on data mining techniques, it is possible to detect and prevent crime in a considerably less time. Classification and clustering-based smart techniques can classify and cluster the crime-related samples. The most important factor in the c...
متن کاملPersistent K-Means: Stable Data Clustering Algorithm Based on K-Means Algorithm
Identifying clusters or clustering is an important aspect of data analysis. It is the task of grouping a set of objects in such a way those objects in the same group/cluster are more similar in some sense or another. It is a main task of exploratory data mining, and a common technique for statistical data analysis This paper proposed an improved version of K-Means algorithm, namely Persistent K...
متن کامل